Code
library(tidyverse)
library(gt)
library(janitor)
library(gridExtra)
library(plotly)
library(ggpubr)
library(htmltools)library(tidyverse)
library(gt)
library(janitor)
library(gridExtra)
library(plotly)
library(ggpubr)
library(htmltools)The data set was obtained from a convience independent sample of 850 DATA2X02 students, through an online survey. In this report, the students were divided into two groups: those who are currently employed and those who are not.
The question asking about time span is one of the things that requires the most some modification. While the majority of the questions asked for weekly hours spent exercising, working, studying, and so forth there was one question that asked for daily hours spent on social media. This may have went unnoticed by the participants since some replies went past 24 hours.
Furthermore, although some questions enabled participants to pick the ‘other’ option while providing all potential answers, such as ‘How do you prefer your steak cooked?’, others only supplied restricted choices that participants had to choose between, such as the mode of transportation to university. In this case, one frequent option, ‘bikes,’ was removed, forcing participants to leave the question unanswered or select a random response to fill in the multiple choice.
Other modifications include altering the actual questions because they were relatively irrelevant to other factors, such as ‘Pick a random number,’ and focusing on a specific trait (such as their WAM) so that the questions revolves around attributes that may affect them to obtain higher quality data.
The data was imported and cleaned in (R Core Team 2022) before being published in the (Allaire et al. 2022) publishing system. The data wrangling stages heavily relied on the (Wickham et al. 2019) package, which consolidates several other useful packages like ggplot2, dplyr, and readr. Furthermore, (Firke 2021) was used since it improved cleaning efficiency. While some graphs were built using ggplot2 and organised using (gridExtra?), interactive graphs were created and formatted using (plotly?) and (htmltools?). To guarantee that the main variable remained consistent throughout the report, (ggpubr?) was utilized to simply change the colours in the charts.
Because the original column names were in question format, they were reduced using the code given (tarr2022?) to improve code readability. Following that, certain manual checks were performed on data with a high standard deviation. This was owing to the possibility that the big standard deviation was attributable to natural variation, therefore looking at their overall answers to all of the questions influences the choice to maintain it or not. Rows containing absurd responses were eliminated in the event that survey participants intended to be dishonest.
Furthermore, because the study concentrates on the disparities between students’ work status and other attributes, survey responses demonstrating that they had any sort of job were combined together in a new column called ‘work clean’ under the classified as employed. Those who ticked the ‘I don’t currently work’ box were classified as unemployed.
x = readr::read_tsv("data/DATA2x02 survey (2022) - Form responses 1.tsv")
#Changing Column Names
old_names = colnames(x)
new_names = c("timestamp","covid_positive","living_arrangements","height","uni_travel_method","uni_travel_listen","spain_budget","feel_overseas","feel_anxious","study_hrs","read_news","study_load","work","lab_zoom","social_media","gender","sleep_time","wake_time","random_number","steak_preference","dominant_hand","normal_advanced","exercise_hrs","employment_hrs","city","weekly_saving","hourly_plan","weeks_behind","assignment_on_time","used_r_before","team_role","data2x02_hrs","social_media_hrs","uni_year","sport","wam","shoe_size","decade_selection")
# overwrite the old names with the new names:
colnames(x) = new_names
# combine old and new into a data frame:
name_combo = bind_cols(New = new_names, Old = old_names)
#Dropping data
x = x %>% drop_na(hourly_plan)
x = x %>% drop_na(normal_advanced)
x = x %>% filter(wam > 25)
x = x %>% filter(study_hrs <100)
#Cleaning work data
x= x %>% mutate(
work_clean = tolower(work),
work_clean = case_when(str_detect(work_clean, "full time") ~ "Employed",
str_detect(work_clean, "part time") ~ "Employed",
str_detect(work_clean, "casual") ~ "Employed",
str_detect(work_clean, "contractor") ~ "Employed",
str_detect(work_clean, "i don't currently work") ~ "Unemployed")) %>% drop_na()Figure 1 contains interactive graphs that displays the percentage distribution as well as the number of students that submitted their assignments on time. Both graphs were utilised as the percentage bar chart, although is useful for comparison between the employed and unemployed students based on how likely they are to submit assignments on time, can also be deceptive since it appears as though there is an equal likelihood for always handing assignments in on time compared to never handing in the assignment. As a result, having the bar chart on the right reveals that most students always are punctual with assignment deadlines, and there are only a few students who never complete an assignment at all.
A Monte Carlo simulation was used to test the significance of these two categorical variables. The reason is that the ‘never’ and the ‘sometimes’ row had very few counts in Table 1 which makes the expected values in lower than five. Thus, this violates the original chi-sqaure assumptions.
my_palette = c('rgba( 102, 194, 165, 100%)', 'rgba( 251, 141, 98, 100%)')
u_always = filter(x, work_clean == "Unemployed") %>%
filter(assignment_on_time == "Always") %>%
nrow()
u_sometimes = filter(x, work_clean == "Unemployed") %>%
filter(assignment_on_time == "Sometimes") %>%
nrow()
u_usually = filter(x, work_clean == "Unemployed") %>%
filter(assignment_on_time == "Usually") %>%
nrow()
u_never = filter(x, work_clean == "Unemployed") %>%
filter(assignment_on_time == "Never") %>%
nrow()
e_always = filter(x, work_clean == "Employed") %>%
filter(assignment_on_time == "Always") %>%
nrow()
e_sometimes = filter(x, work_clean == "Employed") %>%
filter(assignment_on_time == "Sometimes") %>%
nrow()
e_usually = filter(x, work_clean == "Employed") %>%
filter(assignment_on_time == "Usually") %>%
nrow()
e_never = filter(x, work_clean == "Employed") %>%
filter(assignment_on_time == "Never") %>%
nrow()
column = c("Always", "Usually", "Somtimes", "Never")
e = c(e_always, e_usually, e_sometimes, e_never)
u = c(u_always, u_usually, u_sometimes, u_never)
df_count = data.frame(column, e, u)
df_count$column <- factor(df_count$column, levels = df_count[["column"]])
always_sum = u_always + e_always
usually_sum = u_usually +e_usually
sometimes_sum = u_sometimes +e_sometimes
never_sum = u_never + e_never
sum = c(always_sum, usually_sum, sometimes_sum, never_sum)
e_percent = round(c(e/sum)*100, 2)
u_percent = round(c(u/sum)*100, 2)
df_percent = data.frame(column, e_percent, u_percent)
df_percent$column <- factor(df_percent$column, levels = df_percent[["column"]])
#Stacked bar
fig1 = plot_ly(df_percent,
x = ~column,
y = ~e_percent,
type = 'bar',
name = 'Employed',
marker = list(color = c(my_palette[1])))
fig1 = fig1 %>%
add_trace(y = ~u_percent,
name = 'Unemployed',
marker = list(color = c(my_palette[2])))
fig1 = fig1 %>% layout(yaxis = list(title = 'Percentage (%)'),
xaxis = list(title = "Submission on time"),
barmode = 'stack')
#Grouped Bar
fig2 = plot_ly(df_count,
x = ~column,
y = ~e,
type = 'bar',
name = 'Employed',
marker = list(color = c(my_palette[1])))
fig2 = fig2 %>%
add_trace(y = ~u,
name = 'Unemployed',
marker = list(color = c(my_palette[2])))
fig2 = fig2 %>%
layout(yaxis = list(title = 'Count'),
xaxis = list(title = 'Submission on time'),
barmode = 'group')
browsable(div(style = "width:100%;",
div(style = "width:50%; float: left;", fig1),
div(style = "width:50%; float: right;", fig2)))freq_table = table(x$assignment_on_time, x$work_clean)
freq_table = freq_table[c(1,4,3,2),]
knitr::kable(freq_table)
model = chisq.test(freq_table)
expected = model$expected
knitr::kable(expected, digit = 2)| Employed | Unemployed | |
|---|---|---|
| Always | 48 | 39 |
| Usually | 10 | 11 |
| Sometimes | 2 | 1 |
| Never | 3 | 1 |
| Employed | Unemployed | |
|---|---|---|
| Always | 47.66 | 39.34 |
| Usually | 11.50 | 9.50 |
| Sometimes | 1.64 | 1.36 |
| Never | 2.19 | 1.81 |
#Due to expected counts being lower than 5, I used the yate's chi square test
set.seed(2002)
chisq.test(freq_table, simulate.p.value = T, B = 10000)Hypothesis:
\(H_0:\) Assignments submission punctuality are independent of the student’s current work status.
\(H_1:\) Assignments submission punctuality are not independent of the students’ current work status.
Assumptions: No assumptions about the population distribution is needed.
Test statistic: \(T=\sum_{i=1}^r \sum_{j=1}^c \frac{\left(y_{i j}-e_{i j}\right)^2}{e_{i j}}\)
Observed Test Statistic: 1.271
p-value: The p-value is 0.7806
Decision: As the p-value is bigger than the level of significance at \(\alpha = 0.05\) , the data does not provide evidence against \(H_0\) so we can not reject it. There is evidence that suggest that there is no association between the students’ punctuality with assignment submissions and their work status.
WAM stands for weighted average marks and is updated with the overall mark for each unit depending on the amount of credit points. Figure 2 illustrates how WAM is distributed based on a student’s employment status. According to the density chart, both employed and unemployed students have similar distributions, with the employed students graph shifted to the right of the unemployed students. Furthermore, the box plot indicates that employed students have higher WAM than unemployed students. This is consistent with the Wilcoxon sum ranked test assumption based on two samples.
Additionally, the robustness attribute of the Wilcoxon sum ranked test allows for the existence of some outliers that were included since they were deemed appropriate without significantly affecting the observed test statistic as a t-test would. These are depicted by the red dots on the box plot.
p1 = ggplot(x, aes(x=wam, fill = work_clean)) +
geom_density(alpha = 0.5)+
labs(x= "Weighted Average Score (WAM)",
y= "Count",
fill = "Employment Status") +
scale_fill_brewer(palette = "Set2") +
theme_bw()+
theme(legend.position="bottom",
legend.key.size = unit(0.5, 'cm'),
legend.title = element_text(size=9), legend.text = element_text(size=7))
p2 = ggplot(x, aes(x= work_clean, y=wam, fill = work_clean)) +
geom_boxplot(outlier.colour = 'red') +
scale_fill_brewer(palette="Set2") +
labs(y="Weighted Average Score (WAM)",
x="Employment Status",
fill = "Employment Status") +
theme_bw() +
theme(legend.position="bottom", legend.key.size = unit(0.5, 'cm'),
legend.title = element_text(size=9), legend.text = element_text(size=7))
grid.arrange(p1, p2, ncol = 2) df= x %>%
select(wam, work_clean) %>%
mutate(r = rank(wam)) %>%
arrange(r)
df %>%
group_by(work_clean) %>%
summarise(
w = sum(r),
xbar = mean(wam),
SD = sd(wam),
n = n()
) %>%
knitr::kable(digits = 3, col.names = c("Work Status", "w", "Mean", "SD", "n"))| Work Status | w | Mean | SD | n |
|---|---|---|---|---|
| Employed | 4244 | 76.425 | 10.402 | 63 |
| Unemployed | 2426 | 71.788 | 9.029 | 52 |
e = x %>% filter(work_clean == "Employed")
e_wam = e$wam
u = x %>% filter(work_clean == "Unemployed")
u_wam = u$wam
wilcox.test(e_wam, u_wam, alternative = "greater", correct = F)Hypothesis: Suppose samples \(X_1, X_2,…, X_{n_x}\) and \(Y_1, Y_2,…, Y_{n_y}\) are taken from the employed and unemployed students.
\(H_0: \mu_x = \mu_y\)
\(H_1: \mu_x > \mu_y\)
Assumptions: All data points collected are independent and follow the same distribution that differs by a shift shown in both the box plot and histogram in Figure 2.
Test Statistic: \(W=R_1, +R_2 + … R_{n_x}\). Under \(H_0, W\) ~ \(WRS'(n_X, n_Y)\) distribution, where \(n_x = 63\), \(n_y = 52\), taken from Table 2
Observed test statistic: \(w=r_1 + r_2 +… +r_{n_x} = 4244\)
p-value: \(P(W\ge w) = 0.0004\)
Decision: As the p-value is less than the 0.05 level of signifiance, there is sufficient evidence to reject \(H_0\). Therefore, it can concluded that there is strong enough evidence that WAM from employed and unemployed students are statistically different.
Figure 3 is an interactive histogram that overlays study hours from employed and unemployed students to illustrate study hours distribution. Students who are employed frequently study fewer hours than students who are currently unemployed. Figure 4 was created to test for normality. As seen in the QQ plot, the majority of the points in are rather near to the QQ line. The box plot also appears to show that the distribution is slightly skewed to the right, since the top whisker is considerably longer than the bottom whisker. However, because the sample size of both employed and unemployed students is high (63 employed students and 52 unemployed students seen in Table 3), the central limit theorem (CLT) is applied. According to CLT, regardless of the population distribution, as the sample size increases (usually above n=30), it approaches a normal distribution. As shown in Table 3, the standard deviations are not the same therefore a two sample Welch t-test was used. Moreover, the box plot only showed two outliers so it makes it less likely to potentially cause a large change in the observed test statistic.
ggplotly(ggplot(x, aes(x=study_hrs, fill = work_clean)) +
geom_histogram(alpha=0.7, position="identity", bins = 20) +
scale_fill_brewer(palette = "Set2") +
theme_bw() +
labs(x = "Study hours",
y= "Counts",
fill = "Employment status") +
xlim(0,90))p1 = ggplot(x, aes(sample=study_hrs, color=as.factor(work_clean))) + stat_qq() + stat_qq_line() + labs(y = "Sample Quantiles", x = "Theoretical quantiles", color = "Employment Status") + color_palette("Set2") +
theme(legend.position="bottom",
legend.key.size = unit(0.5, 'cm'),
legend.title = element_text(size=9),
legend.text = element_text(size=7))
p2 = ggplot(x, aes(x= work_clean, y= study_hrs, fill = work_clean)) +
geom_boxplot(outlier.colour = 'red') +
scale_fill_brewer(palette="Set2") +
labs(y="Study hours",
x="Employment Status",
fill = "Employment Status") +
theme_bw() +
theme(legend.position="bottom", legend.key.size = unit(0.5, 'cm'),
legend.title = element_text(size=9), legend.text = element_text(size=7))
grid.arrange(p1, p2, ncol = 2) summary = x %>% select(study_hrs, work_clean) %>% group_by(work_clean) %>%
summarise(n = n(),
Mean = mean(study_hrs),
SD = sd(study_hrs)
)
knitr::kable(summary, digits = 2, col.names = c("Work Status", 'n', "Mean", "SD"))| Work Status | n | Mean | SD |
|---|---|---|---|
| Employed | 63 | 23.59 | 14.08 |
| Unemployed | 52 | 29.00 | 16.01 |
e_study = e$study_hrs
u_study = u$study_hrs
t.test(e_study, u_study)Hypotheses: Let \(X_i\) be the hours of study for the \(i^{th}\) employed student and \(y_j\) the \(j^{th}\) unemployed student. Then \(\mu_x\) and \(\mu_y\) be the study hours population means for employed and unemployed students respectively.
\(H_0: \mu_x= \mu_y\)
\(H_1: \mu_x≠\mu_y\)
Assumptions: Both populations are independent and should be normally distributed. Although, Figure 3 doesn’t show that adequate proof for normality, it can be seen from Figure 4 that both the QQ plot and box plot shows definite proof from the closeness to the QQ line and the symmetry respectively.
Test statistic: \(T=\frac{\bar{X}-\bar{Y}}{\sqrt{\frac{S_X^2}{n_X}+\frac{S_Y^2}{n_Y}}}\) where \(S_e^2\) and \(S_u^2\) are the sample variance of the \(X\) and \(Y\), respectively. Under \(H_0, T \sim t_\nu\) where the degrees of freedom, \(\nu\), is estimated from the data.
Observed test statistic: \(t_0 = -1.905\)
p-value: \(2P(t_{102.54} \ge |-1.905|) = 0.06\)
Decision: As the p-value is greater than \(\alpha =0.05\), we can conclude that the data is consistent with \(H_0\). Thus it appears that there is no evidence that study hours are different between employed and unemployed students.
To summarise, the purpose of this study was to examine a variety of variables to see whether there is a substantial difference between employed and unemployed students. The student’s punctuality with handing in assignments, WAM, and study hours were examined, as it may be assumed that employed students had less time overall to study than unemployed students. However, the three tests resulted in some extremely fascinating conclusions. Most notably, there is evidence that employed and unemployed students spend roughly the same amount of time studying, which can be linked to the decision made with the first question, which was that there was insufficient evidence between a student’s punctuality with handing in assignments and their work status. Furthermore, there is evidence that WAM from working and unemployed students differ statistically. Although the alternative hypothesis cannot be confirmed simply by rejecting the null hypothesis, based on the distribution shown through the charts, employed students are likely to have a higher WAM than unemployed students. However, it should be noted that the survey data might be highly bias, thus can massively influence the results.